This is the R vignette of the Bulk RNA-Seq Transcriptomics project. In this vignette we will start from a Recount table, whose counts derive from RNA-seq samples of human brain cells compared to pancreas and liver cells (see http://159.149.160.56/GT_files/July30/ and also https://gtexportal.org/home/).
Datasets are in the “Ranged Summarized Experiment” format of Recount2.
library("recount", quietly = TRUE)
Let’s start from the first tissue mentioned: brain. We can load each dataset with the load() command.
data <- load("rse_gene_brain_9_scaled.Rdata")
data
[1] "rse"
The object containing the data is called “rse” and its type is “RangedSummarizedExperiment”.
rse
class: RangedSummarizedExperiment
dim: 58037 10
metadata(0):
assays(1): counts
rownames(58037): ENSG00000000003.14 ENSG00000000005.5 ... ENSG00000283698.1 ENSG00000283699.1
rowData names(3): gene_id bp_length symbol
colnames(10): SRR2166176 SRR2167642 ... SRR1397094 SRR817686
colData names(82): project sample ... title characteristics
Counts have already been normalized/scaled (to 40M reads per column), therefore they do not need:
#rse <- scale_counts(rse)
All row data are identical, they represent the Gencode V25 comprehensive annotation.
We can explore the information associated with the samples with colData, and with rows with rowData.
head(colData(rse), 2)
DataFrame with 2 rows and 82 columns
project sample experiment run read_count_as_reported_by_sra reads_downloaded
<character> <character> <character> <character> <numeric> <numeric>
SRR2166176 SRP012682 SRS1036203 SRX1152700 SRR2166176 194283036 194280472
SRR2167642 SRP012682 SRS1036572 SRX1153642 SRR2167642 178485738 178483520
proportion_of_reads_reported_by_sra_downloaded paired_end sra_misreported_paired_end mapped_read_count auc sharq_beta_tissue
<numeric> <logical> <logical> <integer> <numeric> <logical>
SRR2166176 0.999987 TRUE FALSE 191059974 42413095962 NA
SRR2167642 0.999988 TRUE FALSE 175652255 38996160832 NA
sharq_beta_cell_type biosample_submission_date biosample_publication_date biosample_update_date avg_read_length geo_accession
<logical> <logical> <logical> <logical> <integer> <logical>
SRR2166176 NA NA NA NA 500 NA
SRR2167642 NA NA NA NA 500 NA
bigwig_file sampid smatsscr smcenter smpthnts smrin smts smtsd smubrid
<character> <character> <integer> <character> <character> <numeric> <character> <character> <character>
SRR2166176 SRR2166176_SRS103620.. GTEX-T5JC-0011-R11A-.. NA C1, A1 9.8 Brain Brain - Cerebellar H.. 0002037
SRR2167642 SRR2167642_SRS103657.. GTEX-T6MN-0011-R11A-.. NA C1, A1 8.9 Brain Brain - Cerebellar H.. 0002037
smtspax smtstptref smnabtch smnabtcht smnabtchd smgebtch smafrze smgtc sme2mprt smchmprs smntrart
<integer> <character> <character> <character> <character> <character> <character> <character> <numeric> <integer> <numeric>
SRR2166176 NA Actual Death BP-24011 RNA isolation_PAXgen.. 04/10/2012 LCSET-4990 0.652753 17688 0.937885
SRR2167642 NA Actual Death BP-24029 RNA isolation_PAXgen.. 04/11/2012 LCSET-5404 0.681510 18208 0.936907
smnumgps smmaprt smexncrt sm550nrm smgnsdtc smunmprt sm350nrm smrdlgth smmncpb sme1mmrt smsflgth smestlbs smmppd
<integer> <numeric> <numeric> <numeric> <integer> <numeric> <numeric> <integer> <numeric> <numeric> <integer> <integer> <integer>
SRR2166176 500 0.740389 0.754849 0.296297 24696 0.603455 0.757195 250 198.658 0.00149830 290 68340470 134027135
SRR2167642 455 0.758914 0.732335 0.280716 25082 0.677837 0.835082 250 204.598 0.00148993 275 89331173 126114611
smnterrt smrrnanm smrdttl smvqcfl smmncv smtrscpt smmppdpr smcglgth smgappct smunpdrd smntrnrt smmpunrt smexpeff
<numeric> <integer> <integer> <integer> <numeric> <integer> <integer> <integer> <numeric> <integer> <numeric> <numeric> <numeric>
SRR2166176 0.0604812 2645283 181022622 13260414 0.638870 142012 57059850 44086 0.0251530 0 0.183036 0.446791 0.558882
SRR2167642 0.0615550 2434834 166177644 12308094 0.652932 142650 54698354 48115 0.0250529 0 0.204572 0.514420 0.555780
smmppdun sme2mmrt sme2anti smaltalg sme2snse smmflgth sme1anti smspltrd smbsmmrt sme1snse sme1pcts smrrnart sme1mprt
<integer> <numeric> <integer> <integer> <integer> <integer> <integer> <integer> <numeric> <integer> <numeric> <numeric> <numeric>
SRR2166176 80879363 0.00191255 1486357 5202496 47024056 403 59582880 32214781 0.00168091 1915592 3.11486 0.014613 0.828024
SRR2167642 85485135 0.00178416 1409790 4712790 44993677 401 55158090 29473960 0.00162204 1750181 3.07544 0.014652 0.836319
smnum5cd smdpmprt sme2pcts title characteristics
<integer> <numeric> <numeric> <logical> <CharacterList>
SRR2166176 877 0.396545 96.9360 NA NA
SRR2167642 855 0.322163 96.9619 NA NA
rowData(rse)
DataFrame with 58037 rows and 3 columns
gene_id bp_length symbol
<character> <integer> <CharacterList>
ENSG00000000003.14 ENSG00000000003.14 4535 TSPAN6
ENSG00000000005.5 ENSG00000000005.5 1610 TNMD
ENSG00000000419.12 ENSG00000000419.12 1207 DPM1
ENSG00000000457.13 ENSG00000000457.13 6883 SCYL3
ENSG00000000460.16 ENSG00000000460.16 5967 C1orf112
... ... ... ...
ENSG00000283695.1 ENSG00000283695.1 61 NA
ENSG00000283696.1 ENSG00000283696.1 997 NA
ENSG00000283697.1 ENSG00000283697.1 1184 HSFX3
ENSG00000283698.1 ENSG00000283698.1 940 NA
ENSG00000283699.1 ENSG00000283699.1 60 MIR4481
But also the genomic coordinates of genes.
rowRanges(rse)
GRanges object with 58037 ranges and 3 metadata columns:
seqnames ranges strand | gene_id bp_length symbol
<Rle> <IRanges> <Rle> | <character> <integer> <CharacterList>
ENSG00000000003.14 chrX 100627109-100639991 - | ENSG00000000003.14 4535 TSPAN6
ENSG00000000005.5 chrX 100584802-100599885 + | ENSG00000000005.5 1610 TNMD
ENSG00000000419.12 chr20 50934867-50958555 - | ENSG00000000419.12 1207 DPM1
ENSG00000000457.13 chr1 169849631-169894267 - | ENSG00000000457.13 6883 SCYL3
ENSG00000000460.16 chr1 169662007-169854080 + | ENSG00000000460.16 5967 C1orf112
... ... ... ... . ... ... ...
ENSG00000283695.1 chr19 52865369-52865429 - | ENSG00000283695.1 61 <NA>
ENSG00000283696.1 chr1 161399409-161422424 + | ENSG00000283696.1 997 <NA>
ENSG00000283697.1 chrX 149548210-149549852 - | ENSG00000283697.1 1184 HSFX3
ENSG00000283698.1 chr2 112439312-112469687 - | ENSG00000283698.1 940 <NA>
ENSG00000283699.1 chr10 12653138-12653197 - | ENSG00000283699.1 60 MIR4481
-------
seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths
Select the three columns we have to work on: 8 -> 8 9 10. What are their names?
rownames(colData(rse)[colnames(rse)[8:10],])
[1] "SRR1337909" "SRR1397094" "SRR817686"
Total number of reads per column:
c(sum(assays(rse)$counts[,8]), sum(assays(rse)$counts[,9]), sum(assays(rse)$counts[,10]))
[1] 38280901 36625433 37946538
Remove all genes with length < 200, or, in other words, keep all genes with length >= 200…
rse <- rse[, 8:10]
rse <- rse[rowData(rse)$bp_length >= 200,]
and produce the overall read count associated with these genes in each tissue/replicate. Let us extract the count matrix from the rse object.
longerThan200 <- assays(rse)$counts
head(longerThan200)
SRR1337909 SRR1397094 SRR817686
ENSG00000000003.14 485 500 234
ENSG00000000005.5 8 1 1
ENSG00000000419.12 319 584 511
ENSG00000000457.13 297 332 461
ENSG00000000460.16 138 187 412
ENSG00000000938.12 146 241 837
Total number of reads tissue per column after removal:
c(sum(longerThan200[,1]), sum(longerThan200[,2]), sum(longerThan200[,3]))
[1] 38174026 36555701 37888352
Before removing all mitochondrial genes we have to check if these ones begin with MT.
unlist(rowData(rse)$symbol, use.names = FALSE)[grep("^MT-*", unlist(rowData(rse)$symbol, use.names = FALSE))]
[1] "MTMR7" "MTMR11" "MTA3" "MTMR1" "MTHFD2" "MTFR1" "MTIF2" "MTMR2" "MT3" "MTAP" "MTMR3"
[12] "MTHFD1" "MTG2" "MTMR8" "MT4" "MTHFSD" "MTFMT" "MTMR9" "MTPN" "MTPAP" "MTMR4" "MTCH2"
[23] "MTRF1L" "MTR" "MTFR1L" "MTHFD1L" "MTRF1" "MTERF2" "MTIF3" "MTERF4" "MTRR" "MT1G" "MT2A"
[34] "MTERF1" "MTX2" "MTUS1" "MTSS1L" "MTUS2" "MTNR1B" "MTO1" "MTHFS" "MTCH1" "MTTP" "MTMR6"
[45] "MTF2" "MTFR2" "MTDH" "MTG1" "MTA2" "MTMR12" "MTERF3" "MTMR14" "MTHFD2L" "MTMR10" "MTNR1A"
[56] "MTCL1" "MT1B" "MT1E" "MTSS1" "MTM1" "MTBP" "MTX1" "MTHFR" "MTX3" "MTURN" "MTA1"
[67] "MT1X" "MTF1" "MT1F" "MTOR" "MT1H" "MT1A" "MT1M" "MTCP1" "MTCYBP36" "MTCYBP45" "MTOR-AS1"
[78] "MTCYBP39" "MTCYBP38" "MTCYBP42" "MTRNR2L4" "MTCO2P34" "MTCO1P57" "MTCYBP34" "MTFP1" "MT1HL1" "MTHFD2P1" "MTCYBP44"
[89] "MTCYBP37" "MTCYBP40" "MTRNR2L5" "MTCYBP43" "MTCYBP35" "MTCYBP41" "MTRNR2L8" "MT1JP" "MTRNR2L10" "MTRNR2L3" "MTRNR2L1"
[100] "MTRNR2L7" "MT1L" "MTCYBP33" "MTRNR2L2" "MTRNR2L6" "MTND1P37" "MTND4LP32" "MTND5P42"
Then we can keep !(mitochondrial genes)…
mt <- rowData(rse)$gene_id[grep("^MT-*", unlist(rowData(rse)$symbol, use.names = FALSE))]
and produce the overall read count associated with these genes in each tissue/replicate.
filtered1 <- assays(rse)$counts
filtered1 <- filtered1[which(!(rownames(filtered1) %in% mt)),]
head(filtered1)
SRR1337909 SRR1397094 SRR817686
ENSG00000000003.14 485 500 234
ENSG00000000005.5 8 1 1
ENSG00000000419.12 319 584 511
ENSG00000000457.13 297 332 461
ENSG00000000460.16 138 187 412
ENSG00000000938.12 146 241 837
Total number of reads tissue per column after removal:
c(sum(filtered1[,1]), sum(filtered1[,2]), sum(filtered1[,3]))
[1] 38114947 36483308 37797654
GEO information was absent for the SRP012682 dataset.
colData(rse)[, c("geo_accession", "title", "characteristics")]
DataFrame with 3 rows and 3 columns
geo_accession title characteristics
<logical> <logical> <CharacterList>
SRR1337909 NA NA NA
SRR1397094 NA NA NA
SRR817686 NA NA NA
We can expand the biological metadata information by adding predictions based on RNA-seq data. The predictions include information about sex, sample source (cell line vs tissue), tissue and the sequencing strategy used.
rse <- add_predictions(rse)
2021-07-27 12:31:23 downloading the predictions to /var/folders/4k/0lqzcm0545d30b5nyr_dptyh0000gn/T//RtmpO7I0FM/PredictedPhenotypes_v0.0.06.rda
provo con l'URL 'https://github.com/leekgroup/recount-website/blob/master/predictions/PredictedPhenotypes_v0.0.06.rda?raw=true'
Content type 'application/octet-stream' length 548129 bytes (535 KB)
==================================================
downloaded 535 KB
Loading objects:
PredictedPhenotypes
colData(rse)[, 83:ncol(colData(rse))]
DataFrame with 3 rows and 12 columns
reported_sex predicted_sex accuracy_sex reported_samplesource predicted_samplesource accuracy_samplesource reported_tissue
<factor> <factor> <numeric> <factor> <factor> <numeric> <factor>
SRR1337909 male male 0.998113 tissue tissue 0.970958 Brain
SRR1397094 female female 0.998113 tissue tissue 0.970958 Brain
SRR817686 male male 0.998113 tissue tissue 0.970958 Brain
predicted_tissue accuracy_tissue reported_sequencingstrategy predicted_sequencingstrategy accuracy_sequencingstrategy
<factor> <numeric> <factor> <factor> <numeric>
SRR1337909 Brain 0.965611 PAIRED PAIRED 0.999371
SRR1397094 Brain 0.965611 PAIRED PAIRED 0.999371
SRR817686 Brain 0.965611 PAIRED PAIRED 0.999371
Project SRP012682 has a few extra biologically relevant variables via the SRA Run selector https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP012682.
sra <- read.csv("SraRunTable.txt", header = TRUE)
# Explore it
head(sra)
# Set all column names in lower case
colnames(sra) <- tolower(colnames(sra))
# Choose some meaningful variables we want to add
sra_vars <- c(
"sex", "body_site", "histological_type", "is_tumor", "submitted_subject_id"
)
stopifnot(all(sra_vars %in% colnames(sra)))
# Re-organize the SRA table based on the SRA Run IDs we have
sra1 <- sra[match(colData(rse)$run, sra$run), ]
# Double check the order
stopifnot(identical(colData(rse)$run, as.character(sra1$run)))
# Append the variables of interest
colData(rse) <- cbind(colData(rse), sra1[, sra_vars])
colData(rse)[, 95:ncol(colData(rse))]
DataFrame with 3 rows and 5 columns
sex body_site histological_type is_tumor submitted_subject_id
<character> <character> <character> <character> <character>
SRR1337909 male Brain - Anterior cin.. Brain No GTEX-144GL
SRR1397094 female Brain - Nucleus accu.. Brain No GTEX-13X6K
SRR817686 male Brain - Substantia n.. Brain No GTEX-X4XX
Same steps for the other two tissues.
data <- load("rse_gene_pancreas_0_scaled.Rdata")
rownames(colData(rse)[colnames(rse)[3:5],])
[1] "SRR1374739" "SRR1097883" "SRR2167209"
c(sum(assays(rse)$counts[,3]), sum(assays(rse)$counts[,4]), sum(assays(rse)$counts[,5]))
[1] 37947698 37248113 36301737
rse <- rse[, 3:5] # 3 -> 3 4 5.
rse <- rse[rowData(rse)$bp_length >= 200,]
longerThan200 <- assays(rse)$counts
head(longerThan200)
SRR1374739 SRR1097883 SRR2167209
ENSG00000000003.14 644 555 448
ENSG00000000005.5 1 1 0
ENSG00000000419.12 534 790 1441
ENSG00000000457.13 568 476 491
ENSG00000000460.16 223 219 195
ENSG00000000938.12 631 154 53
c(sum(longerThan200[,1]), sum(longerThan200[,2]), sum(longerThan200[,3]))
[1] 37886305 37184058 36220949
#Check if mitochondrial genes begin with MT.
unlist(rowData(rse)$symbol, use.names = FALSE)[grep("^MT-*", unlist(rowData(rse)$symbol, use.names = FALSE))]
[1] "MTMR7" "MTMR11" "MTA3" "MTMR1" "MTHFD2" "MTFR1" "MTIF2" "MTMR2" "MT3" "MTAP" "MTMR3"
[12] "MTHFD1" "MTG2" "MTMR8" "MT4" "MTHFSD" "MTFMT" "MTMR9" "MTPN" "MTPAP" "MTMR4" "MTCH2"
[23] "MTRF1L" "MTR" "MTFR1L" "MTHFD1L" "MTRF1" "MTERF2" "MTIF3" "MTERF4" "MTRR" "MT1G" "MT2A"
[34] "MTERF1" "MTX2" "MTUS1" "MTSS1L" "MTUS2" "MTNR1B" "MTO1" "MTHFS" "MTCH1" "MTTP" "MTMR6"
[45] "MTF2" "MTFR2" "MTDH" "MTG1" "MTA2" "MTMR12" "MTERF3" "MTMR14" "MTHFD2L" "MTMR10" "MTNR1A"
[56] "MTCL1" "MT1B" "MT1E" "MTSS1" "MTM1" "MTBP" "MTX1" "MTHFR" "MTX3" "MTURN" "MTA1"
[67] "MT1X" "MTF1" "MT1F" "MTOR" "MT1H" "MT1A" "MT1M" "MTCP1" "MTCYBP36" "MTCYBP45" "MTOR-AS1"
[78] "MTCYBP39" "MTCYBP38" "MTCYBP42" "MTRNR2L4" "MTCO2P34" "MTCO1P57" "MTCYBP34" "MTFP1" "MT1HL1" "MTHFD2P1" "MTCYBP44"
[89] "MTCYBP37" "MTCYBP40" "MTRNR2L5" "MTCYBP43" "MTCYBP35" "MTCYBP41" "MTRNR2L8" "MT1JP" "MTRNR2L10" "MTRNR2L3" "MTRNR2L1"
[100] "MTRNR2L7" "MT1L" "MTCYBP33" "MTRNR2L2" "MTRNR2L6" "MTND1P37" "MTND4LP32" "MTND5P42"
mt <- rowData(rse)$gene_id[grep("^MT-*", unlist(rowData(rse)$symbol, use.names = FALSE))]
filtered2 <- assays(rse)$counts
filtered2 <- filtered2[which(!(rownames(filtered2) %in% mt)),]
head(filtered2)
SRR1374739 SRR1097883 SRR2167209
ENSG00000000003.14 644 555 448
ENSG00000000005.5 1 1 0
ENSG00000000419.12 534 790 1441
ENSG00000000457.13 568 476 491
ENSG00000000460.16 223 219 195
ENSG00000000938.12 631 154 53
c(sum(filtered2[,1]), sum(filtered2[,2]), sum(filtered2[,3]))
[1] 37833065 37132317 36174592
Columns contain a lot of information, that permit to look up each sample in the GTEx database for further details, e.g. age/sex of the donor, a picture of the tissue sample itself, and so on (https://gtexportal.org/home/histologyPage). Unfortunately we could not appreciate those information before because “sampid”s of our three brain replicates are not present in the GTEx database.
colData(rse)$sampid
[1] "GTEX-139YR-1526-SM-5IFJ1" "GTEX-XBED-0226-SM-47JY8" "GTEX-XXEK-1726-SM-5S2SR"
All samples were collected from males between 50 and 69 yo all died from “ventilator case” (Hardy scale) and only the first had a fibrosis.
colData(rse)$smpthnts
[1] "2 pieces; islets well visualized, focal PanIN-1, some fibrosis"
[2] "2 pieces, 12.5x10 & 14x13mm; Scattered eosinophilic foci in exocrine and endocrine tissue, some having infarcted centers and fibrin thrombi in capillaries."
[3] "2 pieces, 9x8 & 10x9mm; extremely well preserved, minimal fat, Islets well visualized; representative ones delineated"
rse <- add_predictions(rse)
2021-07-27 12:31:26 downloading the predictions to /var/folders/4k/0lqzcm0545d30b5nyr_dptyh0000gn/T//RtmpO7I0FM/PredictedPhenotypes_v0.0.06.rda
provo con l'URL 'https://github.com/leekgroup/recount-website/blob/master/predictions/PredictedPhenotypes_v0.0.06.rda?raw=true'
Content type 'application/octet-stream' length 548129 bytes (535 KB)
==================================================
downloaded 535 KB
Loading objects:
PredictedPhenotypes
colData(rse)[, 83:ncol(colData(rse))]
DataFrame with 3 rows and 12 columns
reported_sex predicted_sex accuracy_sex reported_samplesource predicted_samplesource accuracy_samplesource reported_tissue
<factor> <factor> <numeric> <factor> <factor> <numeric> <factor>
SRR1374739 male male 0.998113 tissue tissue 0.970958 Pancreas
SRR1097883 male male 0.998113 tissue tissue 0.970958 Pancreas
SRR2167209 male male 0.998113 tissue tissue 0.970958 Pancreas
predicted_tissue accuracy_tissue reported_sequencingstrategy predicted_sequencingstrategy accuracy_sequencingstrategy
<factor> <numeric> <factor> <factor> <numeric>
SRR1374739 Pancreas 0.965611 PAIRED PAIRED 0.999371
SRR1097883 Pancreas 0.965611 PAIRED PAIRED 0.999371
SRR2167209 Pancreas 0.965611 PAIRED PAIRED 0.999371
sra2 <- sra[match(colData(rse)$run, sra$run), ]
# Double check the order
stopifnot(identical(colData(rse)$run, as.character(sra2$run)))
# Append the variables of interest
colData(rse) <- cbind(colData(rse), sra2[, sra_vars])
colData(rse)[, 95:ncol(colData(rse))]
DataFrame with 3 rows and 5 columns
sex body_site histological_type is_tumor submitted_subject_id
<character> <character> <character> <character> <character>
SRR1374739 male Pancreas Pancreas No GTEX-139YR
SRR1097883 male Pancreas Pancreas No GTEX-XBED
SRR2167209 male Pancreas Pancreas No GTEX-XXEK
data <- load("rse_gene_liver_8_scaled.Rdata")
rownames(colData(rse)[colnames(rse)[8:10],])
[1] "SRR1349479" "SRR1431248" "SRR1405054"
c(sum(assays(rse)$counts[,8]), sum(assays(rse)$counts[,9]), sum(assays(rse)$counts[,10]))
[1] 40990128 37822190 37364870
rse <- rse[, 8:10] # 8 -> 8 9 10.
rse <- rse[rowData(rse)$bp_length >= 200,]
longerThan200 <- assays(rse)$counts
head(longerThan200)
SRR1349479 SRR1431248 SRR1405054
ENSG00000000003.14 2258 1629 2190
ENSG00000000005.5 2 1 2
ENSG00000000419.12 974 657 546
ENSG00000000457.13 447 800 785
ENSG00000000460.16 343 439 323
ENSG00000000938.12 1252 591 259
c(sum(longerThan200[,1]), sum(longerThan200[,2]), sum(longerThan200[,3]))
[1] 40933056 37751464 37318324
#Check if mitochondrial genes begin with MT.
unlist(rowData(rse)$symbol, use.names = FALSE)[grep("^MT-*", unlist(rowData(rse)$symbol, use.names = FALSE))]
[1] "MTMR7" "MTMR11" "MTA3" "MTMR1" "MTHFD2" "MTFR1" "MTIF2" "MTMR2" "MT3" "MTAP" "MTMR3"
[12] "MTHFD1" "MTG2" "MTMR8" "MT4" "MTHFSD" "MTFMT" "MTMR9" "MTPN" "MTPAP" "MTMR4" "MTCH2"
[23] "MTRF1L" "MTR" "MTFR1L" "MTHFD1L" "MTRF1" "MTERF2" "MTIF3" "MTERF4" "MTRR" "MT1G" "MT2A"
[34] "MTERF1" "MTX2" "MTUS1" "MTSS1L" "MTUS2" "MTNR1B" "MTO1" "MTHFS" "MTCH1" "MTTP" "MTMR6"
[45] "MTF2" "MTFR2" "MTDH" "MTG1" "MTA2" "MTMR12" "MTERF3" "MTMR14" "MTHFD2L" "MTMR10" "MTNR1A"
[56] "MTCL1" "MT1B" "MT1E" "MTSS1" "MTM1" "MTBP" "MTX1" "MTHFR" "MTX3" "MTURN" "MTA1"
[67] "MT1X" "MTF1" "MT1F" "MTOR" "MT1H" "MT1A" "MT1M" "MTCP1" "MTCYBP36" "MTCYBP45" "MTOR-AS1"
[78] "MTCYBP39" "MTCYBP38" "MTCYBP42" "MTRNR2L4" "MTCO2P34" "MTCO1P57" "MTCYBP34" "MTFP1" "MT1HL1" "MTHFD2P1" "MTCYBP44"
[89] "MTCYBP37" "MTCYBP40" "MTRNR2L5" "MTCYBP43" "MTCYBP35" "MTCYBP41" "MTRNR2L8" "MT1JP" "MTRNR2L10" "MTRNR2L3" "MTRNR2L1"
[100] "MTRNR2L7" "MT1L" "MTCYBP33" "MTRNR2L2" "MTRNR2L6" "MTND1P37" "MTND4LP32" "MTND5P42"
mt <- rowData(rse)$gene_id[grep("^MT-*", unlist(rowData(rse)$symbol, use.names = FALSE))]
filtered3 <- assays(rse)$counts
filtered3 <- filtered3[which(!(rownames(filtered3) %in% mt)),]
head(filtered3)
SRR1349479 SRR1431248 SRR1405054
ENSG00000000003.14 2258 1629 2190
ENSG00000000005.5 2 1 2
ENSG00000000419.12 974 657 546
ENSG00000000457.13 447 800 785
ENSG00000000460.16 343 439 323
ENSG00000000938.12 1252 591 259
c(sum(filtered3[,1]), sum(filtered3[,2]), sum(filtered3[,3]))
[1] 40849046 37656652 37204631
colData(rse)$sampid
[1] "GTEX-YB5E-0326-SM-5IFHU" "GTEX-1269C-0626-SM-5FQSS" "GTEX-14AS3-0126-SM-5Q5F4"
The first replicate was collected from a man (40-49 yo) and it’s associated with “necrosis”, man died from ventilator case. Second one comes from a woman (60-69 yo) died from natural causes and the tissue shows congestion and steatosis. Last one comes from a woman died from ventilator case (40-49 yo) and the tissue shows no pathology.
colData(rse)$smpthnts
[1] "2 pieces, 12x10 & 11x11mm; marked central and mid-zonal coagulative necrosis necrosis, shock vs toxic damage by drugs"
[2] "2 pieces, mild macrovesicular steatosis, passive congestion"
[3] "2 pieces"
rse <- add_predictions(rse)
2021-07-27 12:31:28 downloading the predictions to /var/folders/4k/0lqzcm0545d30b5nyr_dptyh0000gn/T//RtmpO7I0FM/PredictedPhenotypes_v0.0.06.rda
provo con l'URL 'https://github.com/leekgroup/recount-website/blob/master/predictions/PredictedPhenotypes_v0.0.06.rda?raw=true'
Content type 'application/octet-stream' length 548129 bytes (535 KB)
==================================================
downloaded 535 KB
Loading objects:
PredictedPhenotypes
colData(rse)[, 83:ncol(colData(rse))]
DataFrame with 3 rows and 12 columns
reported_sex predicted_sex accuracy_sex reported_samplesource predicted_samplesource accuracy_samplesource reported_tissue
<factor> <factor> <numeric> <factor> <factor> <numeric> <factor>
SRR1349479 male male 0.998113 tissue tissue 0.970958 Liver
SRR1431248 female female 0.998113 tissue tissue 0.970958 Liver
SRR1405054 female female 0.998113 tissue tissue 0.970958 Liver
predicted_tissue accuracy_tissue reported_sequencingstrategy predicted_sequencingstrategy accuracy_sequencingstrategy
<factor> <numeric> <factor> <factor> <numeric>
SRR1349479 Liver 0.965611 PAIRED PAIRED 0.999371
SRR1431248 Liver 0.965611 PAIRED PAIRED 0.999371
SRR1405054 Liver 0.965611 PAIRED PAIRED 0.999371
# Re-organize the SRA table based on the SRA Run IDs we have
sra3 <- sra[match(colData(rse)$run, sra$run), ]
# Double check the order
stopifnot(identical(colData(rse)$run, as.character(sra3$run)))
# Append the variables of interest
colData(rse) <- cbind(colData(rse), sra3[, sra_vars])
colData(rse)[, 95:ncol(colData(rse))]
DataFrame with 3 rows and 5 columns
sex body_site histological_type is_tumor submitted_subject_id
<character> <character> <character> <character> <character>
SRR1349479 male Liver Liver No GTEX-YB5E
SRR1431248 female Liver Liver No GTEX-1269C
SRR1405054 female Liver Liver No GTEX-14AS3
Merge them into a single count table/object, for subsequent analyses.
countTable <- merge(filtered1, filtered2, by = "row.names", all = TRUE)
rownames(countTable) <- countTable$Row.names
countTable$Row.names <- NULL
countTable <- data.matrix(countTable, rownames.force = NA)
countTable <- merge(countTable, filtered3, by = "row.names", all = TRUE)
rownames(countTable) <- countTable$Row.names
countTable$Row.names <- NULL
countTable <- data.matrix(countTable, rownames.force = NA)
head(countTable)
SRR1337909 SRR1397094 SRR817686 SRR1374739 SRR1097883 SRR2167209 SRR1349479 SRR1431248 SRR1405054
ENSG00000000003.14 485 500 234 644 555 448 2258 1629 2190
ENSG00000000005.5 8 1 1 1 1 0 2 1 2
ENSG00000000419.12 319 584 511 534 790 1441 974 657 546
ENSG00000000457.13 297 332 461 568 476 491 447 800 785
ENSG00000000460.16 138 187 412 223 219 195 343 439 323
ENSG00000000938.12 146 241 837 631 154 53 1252 591 259
Let’s prepare a table summarizing what we’ve done, one row per tissue/sample.
a <- c("Brain / SRR1337909", "Brain / SRR1397094", "Brain / SRR817686", "Pancreas / SRR1374739", "Pancreas / SRR1097883", "Pancreas / SRR2167209", "Liver / SRR1349479", "Liver / SRR1431248", "Liver / SRR1405054")
b <- c(38280901, 36625433, 37946538, 37947698, 37248113, 36301737, 40990128, 37822190, 37364870)
c <- c(106875, 69732, 58186, 61393, 64055, 80788, 57072, 70726, 46546)
d <- c(0.279186218, 0.190392288, 0.153336781, 0.161783199, 0.171968443, 0.222545824, 0.139233524, 0.186996046, 0.124571556)
e <- c(59079, 72393, 90698, 53240, 51741, 46357, 84010, 94812, 113693)
f <- c(0.154330223, 0.197657731, 0.239015216, 0.140298365, 0.13890905, 0.127699123, 0.204951787, 0.250678239, 0.304277788)
g <- c(sum(countTable[,1]), sum(countTable[,2]), sum(countTable[,3]), sum(countTable[,4]), sum(countTable[,5]), sum(countTable[,6]), sum(countTable[,7]), sum(countTable[,8]), sum(countTable[,9]))
tissuePerColumn <- data.frame(TissueColumn = a, TotalNumberOfReads = b, TotalNumberOfReadsOnShortRNAs = c, PercentageOfReadsOnShortRNAs = d, TotalNumberOfReadsOnMTGenes = e, PercentageOfReadsOnMTgenes = f, TotalNumberOfFilteredReads = g)
rownames(tissuePerColumn) <- tissuePerColumn$TissueColumn
tissuePerColumn$TissueColumn <- NULL
tissuePerColumn
The tool for calling DE genes we choose is edgeR.
library("edgeR")
We’ll process all tissues together, with a suitable design matrix. The row names of the table are the “annoying” ENSG IDs with the number at the end, so we clean them.
ensembl <- gsub("\\..*", "", rownames(countTable))
rownames(countTable) <- ensembl
head(countTable)
SRR1337909 SRR1397094 SRR817686 SRR1374739 SRR1097883 SRR2167209 SRR1349479 SRR1431248 SRR1405054
ENSG00000000003 485 500 234 644 555 448 2258 1629 2190
ENSG00000000005 8 1 1 1 1 0 2 1 2
ENSG00000000419 319 584 511 534 790 1441 974 657 546
ENSG00000000457 297 332 461 568 476 491 447 800 785
ENSG00000000460 138 187 412 223 219 195 343 439 323
ENSG00000000938 146 241 837 631 154 53 1252 591 259
We know that the first three columns are brain cells, the subsequent three are pancreas cells, while the remaining three are liver cells. We give to edgeR the corresponding info.
group <- as.factor(rep(c("Brain","Pancreas","Liver"), c(3,3,3)))
For edgeR the “DE gene” object containing all the info about the dataset as well as the parameters estimated during the different steps of the analysis is a “DGEList”; we add the counts to it.
#countTable[is.na(countTable)] = 0
y <- DGEList(counts=countTable)
y$samples$group <- group
y
An object of class "DGEList"
$counts
SRR1337909 SRR1397094 SRR817686 SRR1374739 SRR1097883 SRR2167209 SRR1349479 SRR1431248 SRR1405054
ENSG00000000003 485 500 234 644 555 448 2258 1629 2190
ENSG00000000005 8 1 1 1 1 0 2 1 2
ENSG00000000419 319 584 511 534 790 1441 974 657 546
ENSG00000000457 297 332 461 568 476 491 447 800 785
ENSG00000000460 138 187 412 223 219 195 343 439 323
50587 more rows ...
$samples
NA
ensembl <- gsub(".*R","",rownames(y$samples))
barplot(y$samples$lib.size*1e-6, names=ensembl, ylab="Library size (millions)", las=2, col=rep(2:4, each=3))
legend("top", legend = c("Brain", "Pancreas", "Liver"), fill=rep(2:4, each=1), bty = "n", inset = c(-0.05,-0.1), xpd=TRUE, horiz=T)
Notice that “norm.factors” is still set to one. First, it is advisable to remove altogether all genes with low or zero counts.
table(rowSums(y$counts==0)==9)
FALSE TRUE
42550 8042
keep.exprs <- filterByExpr(y, group=group)
y <- y[keep.exprs,, keep.lib.sizes=FALSE]
Function keep.exprs permits also to change the parameters of filtering. Let us extract and store in a vector the log of the counts per million before normalization with the “cpm” function.
logcpm_before <- cpm(y, log=TRUE)
We can now normalize the counts. TMM (trimmed mean of M-values) is recommended for most RNA-Seq data where the majority (more than half) of the genes are believed not differentially expressed between any pair of the samples.
y <- calcNormFactors(y, method = "TMM")
y
An object of class "DGEList"
$counts
SRR1337909 SRR1397094 SRR817686 SRR1374739 SRR1097883 SRR2167209 SRR1349479 SRR1431248 SRR1405054
ENSG00000000003 485 500 234 644 555 448 2258 1629 2190
ENSG00000000419 319 584 511 534 790 1441 974 657 546
ENSG00000000457 297 332 461 568 476 491 447 800 785
ENSG00000000460 138 187 412 223 219 195 343 439 323
ENSG00000000938 146 241 837 631 154 53 1252 591 259
24257 more rows ...
$samples
NA
Notice now “norm.factors” has changed, and despite quite relevant differences in library sizes, they remain quite moderate. Distribution of (normalized) log-cpm values across samples:
logcpm <- cpm(y, log=TRUE)
boxplot(logcpm, las=2, names = ensembl, col=rep(2:4, each=3), notch = TRUE)
legend("top", legend = c("Brain", "Pancreas", "Liver"), fill=rep(2:4, each=1), bty = "n", inset = c(-0.05,-0.1), xpd=TRUE, horiz=T)
Boxplots of the log2(normalized_counts) of each sample, one boxplot per sample without outliers.
boxplot(logcpm, las=2, names = ensembl, col=rep(2:4, each=3), outline=FALSE , notch = TRUE)
legend("top", legend = c("Brain", "Pancreas", "Liver"), fill=rep(2:4, each=1), bty = "n", inset = c(-0.05,-0.1), xpd=TRUE, horiz=T)
And before normalization:
boxplot(logcpm_before, las=2, names = ensembl, col=rep(2:4, each=3), notch = TRUE)
legend("top", legend = c("Brain", "Pancreas", "Liver"), fill=rep(2:4, each=1), bty = "n", inset = c(-0.05,-0.1), xpd=TRUE, horiz=T)
The change is small, but it can be noticed by eye. Now we design the linear model; we don’t have to define an “intercept” term for our model. The intercept is on the origin since there is no “base condition” (e.g. cancer cells vs “normal” cells) from which the others can be related by a change.
design <- model.matrix(~0+group, data=y$samples)
colnames(design) <- levels(y$samples$group)
design
Brain Liver Pancreas
SRR1337909 1 0 0
SRR1397094 1 0 0
SRR817686 1 0 0
SRR1374739 0 0 1
SRR1097883 0 0 1
SRR2167209 0 0 1
SRR1349479 0 1 0
SRR1431248 0 1 0
SRR1405054 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
At this point, counts have been normalized and the design defined.
Exploratory analysis: we plot the samples labeled by group and then by replicate.
plotMDS(logcpm, labels=group, col=rep(2:4, each=3))
plotMDS(y, col=rep(2:4, each=3))
MDS is a dimensionality reduction, in which instead of expression the fold ratio values among the three samples are employed. We now estimate the NB dispersion, and plot the BCV.
library(statmod)
y <- estimateDisp(y, design, robust=TRUE)
plotBCV(y, main="BCV Plot")
All the parameters have been stored in the y object, among which the “common” and gene-specific dispersion estimates.
y$common.dispersion
[1] 0.3094967
Now we fit the data to the “generalized linear” model.
fit <- glmQLFit(y, design, robust=TRUE)
plotQLDisp(fit)
For testing for DE genes, we have to specify the contrast. The following is condition 1 (brain) vs condition 2 (pancreas). topTags returns the genes with the highest variation.
qlf.1vs2 <- glmQLFTest(fit, contrast=c(1,0,-1)) # keep in mind grouping (see design matrix) follows alphabetical order, therefore 1 -> Brain 0 -> Liver -1 -> Pancreas
topTags(qlf.1vs2)
Coefficient: 1*Brain -1*Pancreas
Plot log-fold change against log-counts per million, with DE genes highlighted:
plotMD(qlf.1vs2)
abline(h=c(1, 0, -1), col="blue")
The complete results of the test are in qlf.1vs2$table. Let us select the ones with p-value (FDR) < 0.05.
FDR <- p.adjust(qlf.1vs2$table$PValue, method="BH")
sum(FDR < 0.05)
[1] 7507
Or:
summary(decideTests(qlf.1vs2))
1*Brain -1*Pancreas
Down 3084
NotSig 16755
Up 4423
decideTests has a default FDR (BH adjusted p-value) threshold of 0.05, and no check on the log-fold ratio. We can make the selection more stringent, by setting stricter thresholds for both and setting lfc (log-fold-change) = 1/-1.
summary(decideTests(qlf.1vs2, p.value=0.01, lfc=1))
1*Brain -1*Pancreas
Down 1536
NotSig 20502
Up 2224
We retain p-value = 0.01 is a better and more significant value since most of the tools have a threshold around 2,000 for the maximum number of genes that can be submitted for any analysis of this kind.
Here we save the list of genes with adjusted p-value (FDR) lower than 0.01.
deg.1vs2 <- topTags(qlf.1vs2, n=Inf, adjust.method = "BH", sort.by = "PValue", p.value = 0.01)$table
head(deg.1vs2)
We can perform the comparision of condition 1 vs. 3 in a similar way.
qlf.1vs3 <- glmQLFTest(fit, contrast=c(1,-1,0))
deg.1vs3 <- topTags(qlf.1vs3, n=Inf, adjust.method = "BH", sort.by = "PValue", p.value = 0.01)$table
head(deg.1vs3)
But also contrast condition 2 vs. 3.
qlf.2vs3 <- glmQLFTest(fit, contrast=c(0,-1,1))
deg.2vs3 <- topTags(qlf.2vs3, n=Inf, adjust.method = "BH", sort.by = "PValue", p.value = 0.01)$table
head(deg.2vs3)
Now we have all our results, contained here.
head(qlf.1vs2$table)
head(qlf.1vs3$table)
head(qlf.2vs3$table)
Now we want to replace the ENSG IDs with the gene name/symbol, but instead of replacing the names in the table it is better to add an additional column to it. The org.Hs.eg.db is a R package containing the gene annotation for human (same for mouse if you replace Hs with Mm, and so on) in which for each gene in each annotation there is its correspondence to another annotation. We can employ it to “translate” gene IDs or names from one annotation to the other. We also add an “entrezid” column, useful for subsequent analysis since GO and KEGG (the tools we’ll use) work on that one.
library("org.Hs.eg.db")
qlf.1vs2$table$symbol <- mapIds(org.Hs.eg.db,keys=rownames(qlf.1vs2$table), keytype="ENSEMBL", column="SYMBOL", multiVals="first")
'select()' returned 1:many mapping between keys and columns
qlf.1vs2$table$entrezid <- mapIds(org.Hs.eg.db, keys=rownames(qlf.1vs2$table), column="ENTREZID", keytype="ENSEMBL", multiVals="first")
'select()' returned 1:many mapping between keys and columns
qlf.1vs3$table$symbol <- mapIds(org.Hs.eg.db,keys=rownames(qlf.1vs3$table), keytype="ENSEMBL", column="SYMBOL", multiVals="first")
'select()' returned 1:many mapping between keys and columns
qlf.1vs3$table$entrezid <- mapIds(org.Hs.eg.db, keys=rownames(qlf.1vs3$table), column="ENTREZID", keytype="ENSEMBL", multiVals="first")
'select()' returned 1:many mapping between keys and columns
qlf.2vs3$table$symbol <- mapIds(org.Hs.eg.db,keys=rownames(qlf.2vs3$table), keytype="ENSEMBL", column="SYMBOL", multiVals="first")
'select()' returned 1:many mapping between keys and columns
qlf.2vs3$table$entrezid <- mapIds(org.Hs.eg.db, keys=rownames(qlf.2vs3$table), column="ENTREZID", keytype="ENSEMBL", multiVals="first")
'select()' returned 1:many mapping between keys and columns
head(qlf.1vs2$table)
head(qlf.1vs3$table)
head(qlf.2vs3$table)
Now we can select what we consider to be the “DE” genes from the table. We select them by corrected p-value first, then by FC (positive/negative).
deg.1vs2 <- topTags(qlf.1vs2, n=Inf, adjust.method = "BH", sort.by = "PValue", p.value = 0.01)$table
deg.1vs3 <- topTags(qlf.1vs3, n=Inf, adjust.method = "BH", sort.by = "PValue", p.value = 0.01)$table
deg.2vs3 <- topTags(qlf.2vs3, n=Inf, adjust.method = "BH", sort.by = "PValue", p.value = 0.01)$table
head(deg.1vs2)
head(deg.1vs3)
head(deg.2vs3)
And then retrieve the lists of “up” and “down” regulated ones.
up.genes.1vs2 <- deg.1vs2[deg.1vs2$logFC > 0,]
up.genes.1vs3 <- deg.1vs3[deg.1vs3$logFC > 0,]
up.genes.2vs3 <- deg.2vs3[deg.2vs3$logFC > 0,]
head(up.genes.1vs2)
head(up.genes.1vs3)
head(up.genes.2vs3)
down.genes.1vs2 <- deg.1vs2[deg.1vs2$logFC < 0,]
down.genes.1vs3 <- deg.1vs3[deg.1vs3$logFC < 0,]
down.genes.2vs3 <- deg.2vs3[deg.2vs3$logFC < 0,]
head(down.genes.1vs2)
head(down.genes.1vs3)
head(down.genes.2vs3)
Finally, we can save Excel sheets for further use.
library("xlsx")
write.xlsx(deg.1vs2, file="deg.1vs2.xlsx", sheetName = "a", col.names = TRUE, row.names = TRUE, append = FALSE)
write.xlsx(deg.1vs3, file="deg.1vs3.xlsx", sheetName = "b", col.names = TRUE, row.names = TRUE, append = FALSE)
write.xlsx(deg.2vs3, file="deg.2vs3.xlsx", sheetName = "c", col.names = TRUE, row.names = TRUE, append = FALSE)
#up 1vs23
l <- list(up.genes.1vs2, up.genes.1vs3)
common_names = Reduce(intersect, lapply(l, row.names))
l <- lapply(l, function(x) { x[row.names(x) %in% common_names,] })
library(dplyr)
l <- lapply(l, function(x) x%>% select(symbol,entrezid))
# l is a list containing two dataframes with same rownames, symbols & entrezids, as expected. You can check it using l1 <- l[[1]]; l1 <- l1[ order(row.names(l1)), ]; l2 <- l[[2]]; l2 <- l2[ order(row.names(l2)), ]; l1 == l2
# Therefore we can pick either l[[1]] or l[[2]]
write.xlsx(l[[1]], file="up.1vs23.xlsx", sheetName = "d-1vs23", col.names = TRUE, row.names = TRUE, append = FALSE)
#up 2vs13
l <- list(up.genes.1vs2, up.genes.2vs3)
common_names = Reduce(intersect, lapply(l, row.names))
l <- lapply(l, function(x) { x[row.names(x) %in% common_names,] })
l <- lapply(l, function(x) x%>% select(symbol,entrezid))
write.xlsx(l[[1]], file="up.2vs13.xlsx", sheetName = "d-2vs13", col.names = TRUE, row.names = TRUE, append = FALSE)
#up 3vs12
l <- list(up.genes.1vs3, up.genes.2vs3)
common_names = Reduce(intersect, lapply(l, row.names))
l <- lapply(l, function(x) { x[row.names(x) %in% common_names,] })
l <- lapply(l, function(x) x%>% select(symbol,entrezid))
write.xlsx(l[[1]], file="up.3vs12.xlsx", sheetName = "d-3vs12", col.names = TRUE, row.names = TRUE, append = FALSE)
#down 1vs23
l <- list(down.genes.1vs2, down.genes.1vs3)
common_names = Reduce(intersect, lapply(l, row.names))
l <- lapply(l, function(x) { x[row.names(x) %in% common_names,] })
l <- lapply(l, function(x) x%>% select(symbol,entrezid))
write.xlsx(l[[1]], file="down.1vs23.xlsx", sheetName = "e-1vs23", col.names = TRUE, row.names = TRUE, append = FALSE)
#down 2vs13
l <- list(down.genes.1vs2, down.genes.2vs3)
common_names = Reduce(intersect, lapply(l, row.names))
l <- lapply(l, function(x) { x[row.names(x) %in% common_names,] })
l <- lapply(l, function(x) x%>% select(symbol,entrezid))
write.xlsx(l[[1]], file="down.2vs13.xlsx", sheetName = "e-2vs13", col.names = TRUE, row.names = TRUE, append = FALSE)
#down 3vs12
l <- list(down.genes.1vs3, down.genes.2vs3)
common_names = Reduce(intersect, lapply(l, row.names))
l <- lapply(l, function(x) { x[row.names(x) %in% common_names,] })
l <- lapply(l, function(x) x%>% select(symbol,entrezid))
write.xlsx(l[[1]], file="down.3vs12.xlsx", sheetName = "e-3vs12", col.names = TRUE, row.names = TRUE, append = FALSE)
Gene up-regulated in Tissue 1 vs Tissue 2 with the lowest FDR and up-regulated also in Tissue 1 vs Tissue 3.
l <- list(head(up.genes.1vs2), up.genes.1vs3)
common_names = Reduce(intersect, lapply(l, row.names))
l <- lapply(l, function(x) { x[row.names(x) %in% common_names,] })
l <- lapply(l, function(x) x%>% select(symbol,entrezid))
l[[1]][1,]
The gene ontology (GO) and the KEGG pathway analysis are the common downstream procedures to interpret the differential expression results in a biological context. Given a set of genes that are up- or down-regulated under a certain contrast of interest, a GO (or pathway) analysis will find which GO terms (or pathways) are over- or under-represented using annotations for the genes in that set. Suppose we want to identify GO terms and KEGG pathways in group 1 (brain) compared to group 2 (pancreas) from the previous analysis.
Focusing on the ontology of Biological Process (BP) and metabolic pathways (given by KEGG) we can observe up-regulated genes that in the tumors (or more in general in diseases, since we have no cancerous cells condition) tend to be associated with cell differentiation, cell migration and tissue morphogenesis.
go <- goana(list(Up=up.genes.1vs2$entrezid, Down=down.genes.1vs2$entrezid), species="Hs", FDR=0.01)
topGO(go, ontology = "BP", number = 30L, truncate.term = NULL, sort="Up")
keg <- kegga(list(Up=up.genes.1vs2$entrezid, Down=down.genes.1vs2$entrezid), species="Hs", FDR=0.01)
topKEGG(keg, number = 30L, truncate.path = NULL, sort="Up")
library(clusterProfiler)
ego <- enrichGO(up.genes.1vs2$entrezid, OrgDb="org.Hs.eg.db", ont="BP", readable=TRUE, pvalueCutoff=0.01)
library(enrichplot)
dotplot(ego, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
kk <- enrichKEGG(up.genes.1vs2$entrezid, organism = "hsa", pvalueCutoff=0.01)
dotplot(kk, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
These bunch of genes up-regulated showed above are all associated with the nervous system (probably they are over-expressed in older ages and also in cerebrovascular or neurological diseases, principal causes of death of these donors).
Down-regulated 1vs2 genes are mostly associated with pancreatic functions. We can also appreciate the presence of “Maturity onset diabetes of the young” (KEGG) which refers to any of several hereditary forms of diabetes mellitus caused by mutations in an autosomal dominant gene disrupting insulin production in islets of Langerhans of pancreas.
topGO(go, ontology = "BP", number = 30L, truncate.term = NULL, sort="Down")
topKEGG(keg, number = 30L, truncate.path = NULL, sort="Down")
ego <- enrichGO(down.genes.1vs2$entrezid, OrgDb="org.Hs.eg.db", ont="BP", readable=TRUE, pvalueCutoff=0.01)
dotplot(ego, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
kk <- enrichKEGG(down.genes.1vs2$entrezid, organism = "hsa", pvalueCutoff=0.01)
dotplot(kk, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
We can do the same for group 1 (brain) compared to group 3 (liver).
go <- goana(list(Up=up.genes.1vs3$entrezid, Down=down.genes.1vs3$entrezid), species="Hs", FDR=0.01)
topGO(go, ontology = "BP", number = 30L, truncate.term = NULL, sort = "Up")
keg <- kegga(list(Up=up.genes.1vs3$entrezid, Down=down.genes.1vs3$entrezid), species="Hs", FDR=0.01)
topKEGG(keg, number = 30L, truncate.path = NULL, sort="Down")
ego <- enrichGO(up.genes.1vs3$entrezid, OrgDb="org.Hs.eg.db", ont="BP", readable=TRUE, pvalueCutoff=0.01)
dotplot(ego, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
kk <- enrichKEGG(up.genes.1vs3$entrezid, organism = "hsa", pvalueCutoff=0.01)
dotplot(kk, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
We can see similar results, up-regulated genes are associated with neural functions. Let’s go for down-regulated genes.
topGO(go, ontology = "BP", number = 30L, truncate.term = NULL, sort="Down")
topKEGG(keg, number = 30L, truncate.path = NULL,sort="Down")
ego <- enrichGO(down.genes.1vs3$entrezid, OrgDb="org.Hs.eg.db", ont="BP", readable=TRUE, pvalueCutoff=0.01)
dotplot(ego, showCategory=15)
wrong orderBy parameter; set to default `orderBy = "x"`
kk <- enrichKEGG(down.genes.1vs3$entrezid, organism = "hsa", pvalueCutoff=0.01)
dotplot(kk, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
Reaching similar results of 1vs2. Down-regulated genes are associated with hepatic (instead of pancreatic) functions, such as metabolic ones, bile secretion, hormone biosynthesis and so on and so forth.
Finally, group 2 (pancreas) compared to group 3 (liver).
go <- goana(list(Up=up.genes.2vs3$entrezid, Down=down.genes.2vs3$entrezid), species="Hs", FDR=0.01)
topGO(go, ontology = "BP", number = 30L, truncate.term = NULL, sort = "Up")
keg <- kegga(list(Up=up.genes.2vs3$entrezid, Down=down.genes.2vs3$entrezid), species="Hs", FDR=0.01)
topKEGG(keg, number = 30L, truncate.path = NULL, sort = "Up")
ego <- enrichGO(up.genes.2vs3$entrezid, OrgDb="org.Hs.eg.db", ont="BP", readable=TRUE, pvalueCutoff=0.01)
dotplot(ego, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
kk <- enrichKEGG(up.genes.2vs3$entrezid, organism = "hsa", pvalueCutoff=0.01)
dotplot(kk, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
Googling information on those bunch of genes (https://www.ebi.ac.uk/QuickGO/) we see that most of these up-regulated genes (and their final products) are associated with metabolic processing taking place in both pancreatic and hepatic cells (we can see both “Pancreatic secretion” and “Gastric acid secretion” for example).
Let’s see down-regulated genes.
topGO(go, ontology = "BP", number = 30L, truncate.term = NULL, sort="Down")
topKEGG(keg, number = 30L, truncate.path = NULL, sort="Down")
ego <- enrichGO(down.genes.2vs3$entrezid, OrgDb="org.Hs.eg.db", ont="BP", readable=TRUE, pvalueCutoff=0.01)
dotplot(ego, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
kk <- enrichKEGG(down.genes.2vs3$entrezid, organism = "hsa", pvalueCutoff=0.01)
dotplot(kk, showCategory=30)
wrong orderBy parameter; set to default `orderBy = "x"`
We obtained a similar list of down-regulated genes of 1vs3: down-regulated genes are associated with hepatic functions.
The aim of the post-processing bulk RNA-Seq dataset given by this last analysis is achieved. We used two tools (and generated respective clusterProfiler dot plots) in order to determine whether the results “make sense”; the GO annotations & KEGG pathways are consistent with the fact that the genes are up-regulated or down-regulated in our three tissues. In other words, if the samples were “anonymous”, we’d have been able to discover from which tissue each sample was taken from.
This analysis was conducted on:
sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] enrichplot_1.10.2 clusterProfiler_3.18.1 dplyr_1.0.7 xlsx_0.6.5 org.Hs.eg.db_3.12.0
[6] AnnotationDbi_1.52.0 statmod_1.4.36 recount_1.16.1 SummarizedExperiment_1.20.0 Biobase_2.50.0
[11] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7 IRanges_2.24.1 S4Vectors_0.28.1 BiocGenerics_0.36.1
[16] MatrixGenerics_1.2.1 matrixStats_0.59.0 edgeR_3.32.1 limma_3.46.0
loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 rtracklayer_1.50.0 scattermore_0.7 SeuratObject_4.0.2 tidyr_1.1.3
[6] bumphunter_1.32.0 ggplot2_3.3.5 bit64_4.0.5 knitr_1.33 irlba_2.3.3
[11] DelayedArray_0.16.3 data.table_1.14.0 rpart_4.1-15 RCurl_1.98-1.3 GEOquery_2.58.0
[16] derfinder_1.24.2 generics_0.1.0 GenomicFeatures_1.42.3 cowplot_1.1.1 RSQLite_2.2.7
[21] shadowtext_0.0.8 RANN_2.6.1 future_1.21.0 bit_4.0.4 spatstat.data_2.1-0
[26] xml2_1.3.2 httpuv_1.6.1 assertthat_0.2.1 viridis_0.6.1 xfun_0.24
[31] hms_1.1.0 rJava_1.0-4 evaluate_0.14 promises_1.2.0.1 fansi_0.5.0
[36] progress_1.2.2 dbplyr_2.1.1 igraph_1.2.6 DBI_1.1.1 htmlwidgets_1.5.3
[41] spatstat.geom_2.2-0 purrr_0.3.4 ellipsis_0.3.2 backports_1.2.1 biomaRt_2.46.3
[46] deldir_0.2-10 vctrs_0.3.8 ROCR_1.0-11 abind_1.4-5 cachem_1.0.5
[51] ggforce_0.3.3 BSgenome_1.58.0 checkmate_2.0.0 sctransform_0.3.2 GenomicAlignments_1.26.0
[56] prettyunits_1.1.1 goftest_1.2-2 cluster_2.1.2 DOSE_3.16.0 lazyeval_0.2.2
[61] crayon_1.4.1 labeling_0.4.2 pkgconfig_2.0.3 tweenr_1.0.2 nlme_3.1-152
[66] nnet_7.3-16 rlang_0.4.11 globals_0.14.0 lifecycle_1.0.0 miniUI_0.1.1.1
[71] downloader_0.4 BiocFileCache_1.14.0 polyclip_1.10-0 lmtest_0.9-38 rngtools_1.5
[76] Matrix_1.3-4 zoo_1.8-9 base64enc_0.1-3 ggridges_0.5.3 png_0.1-7
[81] viridisLite_0.4.0 bitops_1.0-7 KernSmooth_2.23-20 Biostrings_2.58.0 blob_1.2.1
[86] doRNG_1.8.2 stringr_1.4.0 qvalue_2.22.0 parallelly_1.26.1 readr_1.4.0
[91] jpeg_0.1-8.1 scales_1.1.1 memoise_2.0.0 magrittr_2.0.1 plyr_1.8.6
[96] ica_1.0-2 derfinderHelper_1.24.1 zlibbioc_1.36.0 compiler_4.0.4 scatterpie_0.1.6
[101] RColorBrewer_1.1-2 fitdistrplus_1.1-5 Rsamtools_2.6.0 cli_3.0.0 XVector_0.30.0
[106] listenv_0.8.0 patchwork_1.1.1 pbapply_1.4-3 htmlTable_2.2.1 Formula_1.2-4
[111] MASS_7.3-54 mgcv_1.8-36 tidyselect_1.1.1 stringi_1.6.2 yaml_2.2.1
[116] GOSemSim_2.16.1 askpass_1.1 locfit_1.5-9.4 latticeExtra_0.6-29 ggrepel_0.9.1
[121] grid_4.0.4 VariantAnnotation_1.36.0 fastmatch_1.1-0 tools_4.0.4 future.apply_1.7.0
[126] rstudioapi_0.13 foreach_1.5.1 foreign_0.8-81 gridExtra_2.3 farver_2.1.0
[131] Rtsne_0.15 ggraph_2.0.5 digest_0.6.27 rvcheck_0.1.8 BiocManager_1.30.16
[136] shiny_1.6.0 Rcpp_1.0.6 later_1.2.0 RcppAnnoy_0.0.18 httr_1.4.2
[141] colorspace_2.0-2 XML_3.99-0.6 tensor_1.5 reticulate_1.20 splines_4.0.4
[146] uwot_0.1.10 spatstat.utils_2.2-0 graphlayouts_0.7.1 xlsxjars_0.6.1 plotly_4.9.4.1
[151] xtable_1.8-4 jsonlite_1.7.2 GenomicFiles_1.26.0 tidygraph_1.2.0 R6_2.5.0
[156] Hmisc_4.5-0 pillar_1.6.1 htmltools_0.5.1.1 mime_0.11 glue_1.4.2
[161] fastmap_1.1.0 BiocParallel_1.24.1 codetools_0.2-18 fgsea_1.16.0 utf8_1.2.1
[166] lattice_0.20-44 spatstat.sparse_2.0-0 tibble_3.1.2 curl_4.3.2 rentrez_1.2.3
[171] leiden_0.3.8 GO.db_3.12.1 openssl_1.4.4 survival_3.2-11 rmarkdown_2.9
[176] munsell_0.5.0 DO.db_2.9 GenomeInfoDbData_1.2.4 iterators_1.0.13 reshape2_1.4.4
[181] gtable_0.3.0 spatstat.core_2.2-0 Seurat_4.0.3
Bioconductor; recount quick start guide & workflow; https://bioconductor.org/packages/release/bioc/vignettes/recount/inst/doc/recount-quickstart.html & https://bioconductor.org/packages/release/workflows/vignettes/recountWorkflow/inst/doc/recount-workflow.html
Giulio Pavesi; DE gene analysis with edgeR (2021 Update); http://159.149.160.56/Transcriptomics/edgeR_exercise.html
Bioconductor; Empirical Analysis of Digital Gene Expression Data in R; https://bioconductor.org/packages/release/bioc/html/edgeR.html
Giulio Pavesi; Annotations Exercise; http://159.149.160.56/Transcriptomics/gata6.html
Bioconductor; clusterProfiler; https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html
Mohammed Khalfan; Gene Set Enrichment Analysis with ClusterProfiler; https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/